A PanNET classifier was built by Marco Visano as part of his student
project (043_PanNET_classifier_student_project). For this
classifier feature selection was done unsupervised after training the
random forest on all available probes.
Here, feature selection is performed on prior knowledge. The same DMPs used for consensus clustering are also used as features for training the model. The expectation is that using the same probes will increase the precision of classifier. Also, using a fixed set of probes allows splitting the available data into a training and test sample.
The 155 PanNET samples used to define the epigenetic groups are
loaded as described in the 030_NETG1G2_EPIC script
220609_NETG1G2_EPIC_consensus_clustering_plus_metastases.
Analogous to the classifier trained by Marco, all 155 samples will be used as a training set. Model performance will then be evaluated using cross validation. The samples of the NETG1G2 cohort will be used as an independent testing group.
It needs to be noted that the new testing data will be based on the EPIC technology. The training data does contain only few samples using the EPIC technology. Therefore, technical differences may impact model performance. However, all samples to be added to the classifier are expected to be based on EPIC.
In the classifier trained by Marco training was performed on the ComBat corrected data. The same data was used for cross validation which constitutes a potential data leak leading to overestimated model performance. Due to the low number of EPIC samples it is not possible to reliably perform ComBat batch correction in the cross validation folds. Therefore, all training is performed on the normalized but not batch corrected data. This will allow a more realistic evaluation of model performance.
In the initial analysis the only technical variable associated with the data was the array position. It would be possible to use this correction as part of cross validation.
In Capper 2018 feature selection was performed based on the random forest importance. In this workflow feature selection is performed outside of cross validation. This means that there is data leakage between test and training data in the cross validation leading to overoptimistic model estimates.
The same approach was taken by Marco potentially explaining the reduced precision when predicting the NETG1G2 data compared to cross validation.
As an alternative it is possible to select the same probes that were used for consensus clustering. This still presents a case of data leakage during cross validation, because probe selection is done once at the start of training. However, the probes are not selected directly on the data set but only the 125 samples from Di Domenico 2020. This partially mitigates the issue of overconfidence.
To reduce redundancy in the data, highly correlated probes (Pearson’s rho > 0.8) can be removed. While correlated probes are not problematic for random forest classifierts, they pose serious problems for other machine learning models. In a random forest model correlated features will split the feature importance measure, potentially excluding relevant probes. This mainly is important when performing feature selection based on the intrinsic random forest importance measure.
The methylation data is already normalized and no further feature engineering is required.
To combat the class imbalance, the minority classes are over sampled to match the largest class.
Cross validation is performed to evaluate model performance and tune model hyperparameters.
The model performance can be scored using different summary statistics.
Parameter tuning
n_tree (100 : 1300, 7 steps)mtry (50 : 100, 3 steps)For mtry there is an default value of \(\sqrt{n\_probes}\). The search space is
arranged symmetrically around this value.
The ranger random forest implementation is run including
all DMPs.
Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.
Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.
Based on the model tuning the mtry is set to 75 and n_trees to 500.
A second implementation of random forests in the
randomForest package is used on all DMPs.
Grey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.
ranger modelGrey dots indicate the estimates from individual bootstraps. In red the averages of the cross validation folds are shown and the blue dot indicates the mean over all replicates.
ranger modelThe randomForest implementation is markedly slower
than ranger. Therefore ranger will be
used for further evaluations.
After tuning hyperparameters, the model performance is evaluated using the outer cross validation folds. More detailed statistics are collected to identify sources of error in the model.
Most of these metrics with the exception of the
accuracy, mcc and kap are only
defined for binary classifications. To extend these metrics to
multiclass cases, the results can be averaged with and without weighting
by class size (macro and macro_weighted). It
is also possible to binarize all classifications into correctly
classfied and wrongly classified (micro)
Confidence intervals are calculated from the t-distribution with 4 degrees of freedom.
Default averaging| .metric | .estimator | mean | sd | se | CI_lower | CI_upper |
|---|---|---|---|---|---|---|
| accuracy | multiclass | 0.858 | 0.014 | 0.006 | 0.841 | 0.875 |
| mcc | multiclass | 0.800 | 0.019 | 0.009 | 0.776 | 0.824 |
| kap | multiclass | 0.798 | 0.020 | 0.009 | 0.772 | 0.823 |
| sens | macro | 0.832 | 0.031 | 0.014 | 0.794 | 0.870 |
| spec | macro | 0.959 | 0.005 | 0.002 | 0.953 | 0.964 |
| j_index | macro | 0.791 | 0.034 | 0.015 | 0.749 | 0.833 |
| bal_accuracy | macro | 0.896 | 0.017 | 0.008 | 0.874 | 0.917 |
| precision | macro | 0.863 | 0.019 | 0.008 | 0.839 | 0.886 |
| f_meas | macro | 0.842 | 0.025 | 0.011 | 0.810 | 0.873 |
| .metric | .estimator | mean | sd | se | CI_lower | CI_upper |
|---|---|---|---|---|---|---|
| accuracy | multiclass | 0.858 | 0.014 | 0.006 | 0.841 | 0.875 |
| mcc | multiclass | 0.800 | 0.019 | 0.009 | 0.776 | 0.824 |
| kap | multiclass | 0.798 | 0.020 | 0.009 | 0.772 | 0.823 |
| sens | macro_weighted | 0.858 | 0.014 | 0.006 | 0.841 | 0.875 |
| spec | macro_weighted | 0.935 | 0.009 | 0.004 | 0.924 | 0.947 |
| j_index | macro_weighted | 0.794 | 0.023 | 0.010 | 0.766 | 0.821 |
| bal_accuracy | macro_weighted | 0.897 | 0.011 | 0.005 | 0.883 | 0.911 |
| precision | macro_weighted | 0.856 | 0.014 | 0.006 | 0.839 | 0.873 |
| f_meas | macro_weighted | 0.853 | 0.016 | 0.007 | 0.833 | 0.873 |
For multclass scenarios roc_auc and pr_auc need to be summarized. For
the roc_auc curve the hand-till procedure is insensitive to
class imbalances.
| .metric | .estimator | mean | sd | se | CI_lower | CI_upper |
|---|---|---|---|---|---|---|
| roc_auc | hand_till | 0.973 | 0.002 | 0.001 | 0.970 | 0.975 |
| pr_auc | macro | 0.931 | 0.006 | 0.003 | 0.923 | 0.938 |
| gain_capture | macro | 0.944 | 0.003 | 0.001 | 0.940 | 0.948 |
| mn_log_loss | multiclass | 0.584 | 0.008 | 0.003 | 0.574 | 0.593 |
For roc_auc, pr_auc and
gain_capture measures the individual curves can be
shown
Much of the uncertainty of the model is within the Intermediate_ADM and Intermediate_WT groups. Differentiating between these groups may be harder than differntiating between and Intermediate and more differentiated (Alpha_ and Beta_like) phenotype.
To see how this affects model interpretation, all Intermediate levels are merged. Likewise, Alpha_ and Beta_like levels are merged.
Many of the metrics depend on the chosen test level. Therefore the results are shown for both directions.
Test level: Alpha_ + Beta_like| .metric | .estimator | mean | sd | se | CI_lower | CI_upper |
|---|---|---|---|---|---|---|
| accuracy | binary | 0.937 | 0.005 | 0.002 | 0.930 | 0.943 |
| mcc | binary | 0.845 | 0.011 | 0.005 | 0.832 | 0.859 |
| kap | binary | 0.844 | 0.012 | 0.005 | 0.829 | 0.859 |
| sens | binary | 0.924 | 0.011 | 0.005 | 0.911 | 0.937 |
| spec | binary | 0.942 | 0.010 | 0.005 | 0.929 | 0.954 |
| j_index | binary | 0.865 | 0.007 | 0.003 | 0.857 | 0.874 |
| bal_accuracy | binary | 0.933 | 0.003 | 0.002 | 0.928 | 0.937 |
| precision | binary | 0.855 | 0.021 | 0.009 | 0.830 | 0.881 |
| f_meas | binary | 0.888 | 0.008 | 0.004 | 0.878 | 0.898 |
| .metric | .estimator | mean | sd | se | CI_lower | CI_upper |
|---|---|---|---|---|---|---|
| accuracy | binary | 0.937 | 0.005 | 0.002 | 0.930 | 0.943 |
| mcc | binary | 0.845 | 0.011 | 0.005 | 0.832 | 0.859 |
| kap | binary | 0.844 | 0.012 | 0.005 | 0.829 | 0.859 |
| sens | binary | 0.942 | 0.010 | 0.005 | 0.929 | 0.954 |
| spec | binary | 0.924 | 0.011 | 0.005 | 0.911 | 0.937 |
| j_index | binary | 0.865 | 0.007 | 0.003 | 0.857 | 0.874 |
| bal_accuracy | binary | 0.933 | 0.003 | 0.002 | 0.928 | 0.937 |
| precision | binary | 0.971 | 0.004 | 0.002 | 0.966 | 0.975 |
| f_meas | binary | 0.956 | 0.004 | 0.002 | 0.951 | 0.961 |
Because cross valdation was run in replicates it is possible to calculate how frequently samples are not assigned the same class across all replicates.
* The majority of instability is within the Intermediate groups * In
particular the separation between Intermediate_ADM and
Intermediate_ADM_WT is less stable * The second most frequent istability
is between the Alpha_like and Intermediate classes
From each cross validation fold the top 25 most relevant probes are extracted.
The low stability of the probes likely is a consequence of the way probes are weighted in the random forest models.
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | multiclass | 0.808 |
| mcc | multiclass | 0.740 |
| kap | multiclass | 0.727 |
| sens | macro | 0.710 |
| spec | macro | 0.948 |
| j_index | macro | 0.657 |
| bal_accuracy | macro | 0.829 |
| precision | macro | 0.789 |
| f_meas | macro | 0.700 |
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.846 |
| mcc | binary | 0.435 |
| kap | binary | 0.416 |
| sens | binary | 0.952 |
| spec | binary | 0.400 |
| j_index | binary | 0.352 |
| bal_accuracy | binary | 0.676 |
| precision | binary | 0.870 |
| f_meas | binary | 0.909 |
This indicates that the class imbalance plays an important part in driving these measures. In the validation set only very few non-Intermediate samples are annotated.
For reliable prediction the validation set needs to be located within the range of the training data. Data far outside the training range are likely result in incorrect predictions.
In particular, the validation data is based on EPIC methylation chips, while the training data mostly contains 450K samples.
To test the similarity the validation data is projected into the PCA space defined by the training data. The PCA is based on the DMPs used in consensus clustering.
PCA of training data
Projection of the validation set
The result of the individual consensus clustering does not
match well with the position of the samples in the PCA plot.
This is problematic because the consensus clustering is used to define
the ground truth. This a large contributor to the reduced performance of
the model on the validation data.
In particular the prediction of Alpha_ and Beta_like samples seems
off.
Comparison of predicted classes
Another method for defining the ground truth is required.
An initial random forest classifier was developed by Marco Visani. The training was done following the procedure used by Capper et al to construct the brain tumor classifier.
One of the main open questions was the low overlap between probes
used by the model and probes used for consensus clustering. Here,
training was repeated using the DMPs from consensus clustering.
Additionally, the model training procedure was switched to the
tidymodels workflow that is more generalized and allows
easier switching between different types of classifier.
Model performance is similar or slightly higher compared to the
unsupervised model. Tuning of model parameters had little impact on
performance. This may be because none of the parameters was chosen to a
very low level where performance is expected to degrade. Random forest
model often work well with the default parameters and this is confirmed
here.
Also two different implementation of random forests ranger
and randomForest produced equivalent results.
When measuring the model performance, the imbalance in the classes
should be taken into account. Many metrics such as the accuracy are
overestimates of model performance in imbalanced cases. The likelihood
based metrics roc_auc, pr_auc and
gain_curve show how the model performance is different
across classes. In particular the Intermediate_ADM and
Intermediate_ADM_WT classes seem to be less well defined. The
mn_log_loss is lower compared to Marco’s model indicating a
better fit.
Cross validation was performed in replicates and shows that the
prediction is stable in most cases. Some Intermediate and few Alpha_like
cases show instability. This may be connected to intrinsic ambiguity of
the class identities. The probes used by the model are often unique to a
particular replicate and therefore not very useful to extract biological
information.
The weak link between probes and importance is due to the way probe
importances are assigned by the random forest model. If there is
correlation between probes the importance scores will be distributed
randomly across all correlated probes. This most likely is also the
reason for the poor link between the probes in the model by Marco and
the DMPs.
It is possible to de-correlate the data prior to model fitting. Doing this removes about 1000 probes and slightly degrades model performance. This means that removing these probes also does remove some of the information.
Applying the model to a validation set gives mixed results. The validation set is highly imbalanced and as such many metrics are highly skewed. The overall accuracy is not too bad but the mmc and kap values are quite low. Combining all Intermediate samples allows high recall but this most likely is due to their high number.
Importantly, looking at the similarity of the training and validation data it becomes apparent that the class assignments from the consensus clustering are a quite bad fit to the PCA embedding. Because the individual consensus clustering is used to define the ground truth, having a poor fit does degrade the performance of the model. Otherwise, the fit between training and test data is quite good with no indication that technology does skew the methylation data.
There are several options to improve the model. The most important
one is the need to better define the ground truth in the validation set.
Hoever, it many also be beneficial to revisit the original consensus
clustering. Potential replacements are tree based clustering algorithms
such as Leiden or random walk.
Using the DMPs itself is a case of information leakage that may contribute to overly optimistic estimates in cross validation. A better approach would be to select probes from all of the data. This was done by Marco but outside of cross validation. It has been shown that it is important to also cross validate the feature selection steps. An improved model could do this either on the decorrelated probes or after running PCA to deal with the correlation in the methylation signal.
## function (package = NULL)
## {
## z <- list()
## z$R.version <- R.Version()
## z$platform <- z$R.version$platform
## if (nzchar(.Platform$r_arch))
## z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
## z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer,
## "-bit)")
## z$locale <- Sys.getlocale()
## z$running <- osVersion
## z$RNGkind <- RNGkind()
## if (is.null(package)) {
## package <- grep("^package:", search(), value = TRUE)
## keep <- vapply(package, function(x) x == "package:base" ||
## !is.null(attr(as.environment(x), "path")), NA)
## package <- .rmpkg(package[keep])
## }
## pkgDesc <- lapply(package, packageDescription, encoding = NA)
## if (length(package) == 0)
## stop("no valid packages were specified")
## basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) &&
## x$Priority == "base")
## z$basePkgs <- package[basePkgs]
## if (any(!basePkgs)) {
## z$otherPkgs <- pkgDesc[!basePkgs]
## names(z$otherPkgs) <- package[!basePkgs]
## }
## loadedOnly <- loadedNamespaces()
## loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
## if (length(loadedOnly)) {
## names(loadedOnly) <- loadedOnly
## pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
## z$loadedOnly <- pkgDesc[loadedOnly]
## }
## z$matprod <- as.character(options("matprod"))
## es <- extSoftVersion()
## z$BLAS <- as.character(es["BLAS"])
## z$LAPACK <- La_library()
## l10n <- l10n_info()
## if (!is.null(l10n["system.codepage"]))
## z$system.codepage <- as.character(l10n["system.codepage"])
## if (!is.null(l10n["codepage"]))
## z$codepage <- as.character(l10n["codepage"])
## class(z) <- "sessionInfo"
## z
## }
## <bytecode: 0x55d4f9789c58>
## <environment: namespace:utils>